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Abstract 

Random  Matrix  Theory  (RMT)  is  an  important  tool  for  detecting  correlations  in  multidimensional  time  series,  such  as  stock 
market  price  histories,  and  origin-destination  flows  in  data  networks. 

We  review  the  basic  theory  and  propose  two  novel  applications:  the  detection  of  traffic  anomalies  in  data  networks  and  natural 
language  processing. 

For  traffic  anomalies  the  advantage  of  this  approach  is  that  training  sets  are  not  necessary.  In  the  case  of  natural  language 
processing,  our  approach  is  a  refinement  of  the  standard  Latent  Semantic  Analysis  (LS  A). 

We  will  demonstrate  applications  to  real  traffic  from  a  data  network,  and  present  the  use  in  Natural  Language  Processing. 
Directions  for  future  work  will  be  discussed. 
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1.  Introduction 

The  increasing  practicality  of  large-scale  flow  capture  makes  it  possible  to  conceive  of  Internet  traffic  analysis 
methods  that  detect  and  identify  a  large  and  diverse  set  of  anomalies.  However  the  challenge  of  effectively 
analyzing  this  massive  amount  of  data  for  anomaly  diagnosis  is  as  yet  unmet. 

Over  the  past  two  decades  great  effort  has  been  invested  in  meeting  this  challenge  using  an  impressive  array  of 
different  mathematical  tools  and  approaches  to  harvest  and  understand  network-wide  views  of  traffic  in  the  form  of 
sampled  flow  data1.  One  of  this  techniques  is  Random  Matrix  Theory  (RMT). 

Random  matrices  were  introduced  in  Nuclear  Physics  to  model  the  spectra  of  heavy  nuclei.  Since  its 
introduction,  RMT  has  been  used  to  investigate  ultrasonic  resonances  of  structural  materials,  chaotic  systems,  the 
zeros  of  the  Riemann  and  other  zeta  functions,  and  any  sufficiently  complicated  system2 . 

Barthclemy  et  al.3  showed  that  RMT  can  be  used  to  detect  correlations  among  different  origin-destination  flows. 
Here  we  argue  that  the  correlations  that  are  detected  can  be  used  to  discover  network-wide  traffic  anomalies. 


2.  Overview  of  Random  Matrix  Theory 

Let  A  be  a  matrix  of  size  NxL,  L>N,  with  entries  a:J  that  are  i.i.d.  random  variables  drawn  from  a  probability 
distribution  with  zero  mean  and  variance  cr  .  Such  matrix  is  called  a  random  matrix. 

The  most  common  case  is  that  where  the  atj  are  drawn  from  a  zero-mean  Gaussian  distribution  with  variance  cr  . 
Consider  now  the  covariance  or  Wishart  matrix  R  of  size  Nx  N 


R=—  A-A7, 
IV 


(1) 


A  fundamental  result  of  RMT  is  that  the  eigenvalues  of  R  are  distributed  according  to  the  probability  density 
function  (pdf) 


If*'  =  2tzo2  ~ 


(2) 


where 


A±=cf2(\  +  1/Q±2^7q) 


(3) 


and 


Q  =  lim  — . 

*-~*N 

L — » x. 


(4) 


In  general,  some  further  explanation  is  needed  for  the  limit  in  eq.  (4)  to  exist  and  its  value.  For  the  rest  of  this  paper 
we  will  safely  assume4  that  this  limiting  value  is  simply  L/N . 

Note  that  the  requirement  L  >  N  in  the  definition  of  A  implies  that  Q  >  1.  Hereafter,  and  without  loss  of 
generality,  we  assume  that  a  —  1 . 

Even  though  eq.  (2)  was  first  derived  for  the  case  of  a  matrix  with  random  elements  from  a  zero-mean  Gaussian 
distribution,  recent  results5  show  that  eq.  (2)  also  applies  when  the  a(J  are  randomly  drawn  from  distributions  that 
satisfy 
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(5a) 

(5b) 


Ea .  =  0, 

'i  ’ 

E<=1; 

where  E  stands  for  the  expectation  value. 

Figure  1  is  a  plot  of  eq.  (2)  for  N  =  100  and  different  values  of  L  .  The  key  feature  that  makes  this  result  useful 
is  that  if  A  is  truly  a  random  matrix  with  elements  that  satisfy  the  conditions  in  eq.  (5),  then  the  eigenvalues  of  its 
covariance  matrix  are  in  the  range  [A  ,  A^  ]  and  follow  the  pdf  in  eq.  (2).  If  we  do  not  use  the  limiting  case  from  eq. 
(4)  the  range  is  approximate4. 


N=100 


Fig.  1 .  The  RMT  eigenvalue  probability  density  funetion  eq.  (2) 


Why  is  this  result  useful?  Suppose  that  we  have  N  variables,  and  that  each  of  them  is  observed  L  times.  We 
would  like  to  know  whether  the  data  at  hand  is  random;  more  precisely,  we  would  like  to  know  whether  the 
observations  are  uncorrelated. 

The  null  hypothesis  is  then  that  the  NxL  observations  are  indistinguishable  from  NxL  i.i.d.  random  variables 
drawn  from  a  distribution  with  zero  mean  and  unit  variance.  To  answer  this  question  using  RMT,  define  the  row 
vector  x,  of  length  L  that  contains  the  observations  of  variable  i ,  i.e.,  xtj  is  the  y'th  observation  of  variable  i . 
Assume  that  x(  is  standardized,  i.e.  it  is  zero-mean  and  has  unity  standard  deviation.  In  the  next  step,  form  the 
matrix  X  using  for  its  rows  the  vectors  x,,i  =  l,...,JV;  compute  its  covariance  (or  Wishart)  matrix 

C  =  -jrX-Xr,  (6) 

and  finally  compute  the  eigenvalues  of  C  . 

I  f  the  eigenvalues  of  C  are  distributed  according  to  /’4(,mt)  ,  then  the  data  are  random  variables  drawn  from  one 
or  more  distributions  satisfying  eq.  (5).  Otherwise,  we  can  discover  correlations  in  the  data  by  consider  the 
eigenvalues  outside  the  range  [A_,Ai  ]  and  analyzing  the  corresponding  eigenvectors.  Of  course,  we  have  to  keep  in 
mind  that  we  are  only  approximating  eq.  (4).  This  approach  has  been  used  successfully  to  analyze  financial  data6 
and  Internet  traffic3 . 


3.  Application  of  RMT  to  Anomalous  Traffic  Detection 

The  application  to  anomalous  traffic  detection  is  straightforward. 

Suppose  that  we  have  a  network  with  N  flows,  and  we  wish  to  detect  traffic  anomalies;  and  suppose  that  L  time 
samples  are  obtained  for  each  flow.  Let  x.  be  the  vector  containing  the  L  time-samples  for  flow  i . 

As  explained  above,  we  form  the  matrix  X  from  the  individual  flows  x,  — we  shall  call  the  matrix  X  the  traffic 
matrix — and  proceed  to  compute  its  covariance  matrix  C  as  in  eq.  (6).  However,  an  additional  step  is  required 
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here:  instead  of  looking  at  the  vector  x,  it  is  better  to  consider  the  logarithm  of  the  ratio  of  successive  elements  of 
x(  along  the  time  dimension,  i.e.,  we  use 

xg  -  In  xiJ+i  -  In  xu ,  j  =  1 . L- 1 .  (7) 

The  rationale  for  x,  using  instead  of  x,  is  the  removal  of  any  multiplicative  biases  in  the  time  series. 

Next,  each  \j  is  normalized  to  have  zero  mean  and  unit  variance: 

*.  =  (v -<*.■>)/<  ‘  >>  (8> 

where  (•)  indicates  a  sample  average.  Thus,  the  matrix  Z  formed  from  the  £.  has  dimensions  Mx(L-l), 
while  its  covariance  matrix  C  has  the  same  dimensions  as  C  ,  namely  Nx.N . 

Here  we  present  two  examples  of  this  application:  first  by  considering  a  simulated  traffic  matrix,  and,  then  using 
real  network  traffic. 

3. 1.  Simulated  Network  Traffic 

The  purpose  of  this  simulation  is  to  create  traffic  that  is  uncorrelated,  and  then  artificially  insert  correlations 
between  the  different  flows  to  test  whether  RMT  can  detect  these  anomalous  correlations. 

In  principle,  given  M  nodes  and  N  edges,  a  realistic  simulation  should  take  into  account  the  traffic  created  by 
each  of  the  nodes,  the  way  this  traffic  is  routed,  and  the  time -dependent  behavior  of  the  different  protocols  that 
handle  this  traffic7.  However,  a  reasonable  shortcut  is  based  on  the  finding8  that  internet  flows  can  be  modeled  as  a- 
stablc9  flows . 

We  used  the  Matlab  routine  stabrnd  to  generate  100  a-stable  flows  ( JV  =  1 00  )  each  consisting  of  200  samples 
(L  =  200).  For  each  flow,  we  chose  randomly  the  distribution  parameters  a  and  P  from  a  e  [0.5, 1.8]  and 
P  g  [-0.5 ,0.5]  with  uniform  probability.  Note  that  we  simulated  the  values  of  x, ,  not  .  Furthermore,  the  values 
of  x,  thus  obtained  were  limited  to  xtj  g[-20,  20]  ,  i.e.,  we  limited  the  range  of  the  logarithmic  changes  to  one 
compatible  with  observed  values. 

Next,  we  computed  the  matrix  Z  ,  its  covariance  matrix  C  and  its  eigenvalues.  Figure  2a  is  a  plot  of  the 
eigenvalue  distribution  and  a  comparison  with  the  eigenvalue  distribution  predicted  by  RMT,  eqs.  (2)-(4)  with 
0  =  1.99. 


a  b 


Fig.  2.  (a)  Comparison  of  the  pdf  in  eq.(2)  with  that  for  our  simulation  with  a-tlows;  (b)  the  injected  anomaly 


Next,  we  artificially  introduce  correlations  in  the  simulated  data  to  test  the  detection  capabilities  of  RMT.  To  do 
so,  we  randomly  chose  ten  flows  and  a  time  interval  with  forty  points,  and  added  the  anomaly  in  Fig.  2b;  because  the 
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flows  are  normalized  to  unit  variance,  the  anomaly  amplitude  was  set  to  2.5,  to  obtain  enough  of  a  signal.  The 
resulting  anomalous  flows  can  be  seen  in  Fig.  3. 


Time  Bill  Time  Bln 


Fig.  3.  (a)  The  ten  flows  with  the  injected  anomaly.  The  thick  black  line  is  the  mean  value;  (b)  zoomed  view  of  (a). 

We  computed  the  eigenvalues  of  the  matrix  with  the  anomalous  flows,  Z  ,  and  compared  their  distribution  to  that 
predicted  by  RMT.  The  comparison  can  be  seen  in  Fig.4. 

a  b 


Fig  4.  (a)  Deviation  from  the  RMT  eigenvalue  pdf  after  anomaly  injection;  (b)  eigenvector  components  for  the  largest  eigenvalue. 


Note  that  the  anomalies  introduced  as  described  above  appear  in  Fig.  4a  as  an  outlying  eigenvalue  X  =  3.3 ,  while 
RMT  predicts  A,  =  2.9 .  However,  these  results  raise  a  question:  which  of  the  flows  are  correlated?  Fig.  3  shows 
that  even  if  we  knew  which  flows  were  anomalous,  it  is  difficult  to  detect  the  anomaly  by  eye  (this  is  why  we 
plotted  the  mean  value  to  guide  the  eye). 

The  answer  is  found  by  looking  at  the  eigenvector  of  the  eigenvalue  under  consideration,  in  this  case  the  largest 
one.  According  to  RMT,  if  the  components  of  eigenvector  u,  are  normalized  such  that  ^  ir  =  N  ,  then  they  should 
be  distributed  as  a  Gaussian  pdf  with  zero  mean  and  unit  variance — in  the  RMT  literature  this  distribution  of 
eigencomponents  is  known  as  the  Porter-Thomas  distribution. 

In  our  case,  however,  the  components  of  the  eigenvectors  that  correspond  to  the  eigenvalues  that  deviate  from 
RMT  do  not  obey  the  Porter-Thomas  distribution;  therefore,  by  looking  at  the  pdf  of  the  eigenvector  components 
and  their  values  we  can  identify  the  flows  that  are  correlated.  This  is  demonstrated  in  Fig.  4b  for  the  largest 
eigenvalue:  those  components  with  value  greater  than  «  2  are  anomalous. 

In  Fig.  5a  we  plotted  the  absolute  value  of  the  eigencomponents  for  the  largest  eigenvalue,  and  two  horizontal 
lines  that  identify  the  significance  level  p  of  the  deviations  from  the  Porter-Thomas  distribution  in  Fig.  4b;  the 
thick  lines  identify  the  location  of  the  anomalous  flows.  Note  that  at  a  significance  level  p  =  0.05  we  correctly 
identify  all  the  anomalous  flows,  while  at  p  =  0.01  three  out  of  ten  anomalous  flows  are  detected. 
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This  example  illustrates  the  power  of  the  technique  we  propose. 

a  b 


Fig.  5.  (a)  Eigenvector  components  for  the  largest  eigenvalue  (#100);  (b)  eigenvalue  distribution  For  the  Abilene  data. 


3.2.  Real  Network  Traffic 

To  test  our  methodology  we  used  one  of  the  Abilene  network  datasets.  The  Abilene  network  was  a  high- 
performance  backbone  network  created  by  the  lnternet2  community  in  the  late  1990s.  In  2007  the  Abilene  network 
was  retired  and  the  upgraded  network  became  known  as  the  “Internet2  network.”  Abilene  was  a  private  network 
used  for  education  and  research,  but  was  not  entirely  isolated,  since  its  members  usually  provided  alternative  access 
to  many  of  their  resources  through  the  public  Internet.  The  network  backbone  consisted  of  1 1  points  of  presence 
(PoPs). 

The  dataset  used  here  consists  of  the  121  origin-destination  (OD)  flows  between  the  1 1  PoPs;  note  that  flows 
originating  and  ending  in  the  same  PoP  are  included.  It  is  the  same  used  by  Lakhina  et  al.10  ;  these  authors  collected 
three  weeks  of  sampled  IP-level  traffic  flow  data  from  every  PoP  in  Abilene  for  the  period  8  December  2003  to  28 
December  2003.  Sampling  was  periodic,  at  a  rate  of  1  out  of  1 00  packets.  The  network  reported  flow  statistics  every 
5  minutes,  and  therefore  the  data  is  binned  in  5-minutes  bins. 

We  used  the  RMT  approach  on  the  first  week  of  data  (2016  time  samples).  Fig.  5b  shows  the  eigenvalue 
distribution  and  a  comparison  with  that  predicted  by  RMT  theory. 

Next,  we  examined  the  components  of  the  eigenvector  corresponding  to  the  largest  eigenvalue  in  Fig.  5b.  The 
absolute  value  of  these  components  are  plotted  in  Fig  6a.  Note  the  very  large  values  for  components  45-55. 


Fig.  6.  (a)  Eigenvector  components  for  the  largest  eigenvalue  (#121);  (b)  time  series  for  flows  45-55. 


The  very  large  values  for  flows  45-55  in  Fig.  6a  point  to  a  strong  correlation  between  these  flows,  and  we 
proceeded  to  examine  the  corresponding  time  series;  these  are  plotted  in  Fig.  6b.  These  results  show  the  ability  of 
RMT  to  detect  an  anomaly:  there  was  an  outage  at  the  Indianapolis  node  (the  point  of  origin  of  flows  45-55)  during 
the  first  week  of  data;  while  the  outage  lasted,  it  caused  a  perfect  correlation  between  the  OD  flows  originating  at 
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the  Indianapolis  node;  and  this  correlation  resulted  in  a  very  large  eigenvalue — much  larger  than  that  predicted  by 
RMT. 

4.  Natural  Language  Processing 

Natural  Language  Processing  is  concerned  with  the  interactions  between  computers  and  human  (natural) 
languages.  An  important  example  is  the  analysis  of  relationships  between  a  set  of  documents  (a  corpus)  and  the 
terms  they  contain  by  producing  a  set  of  concepts  related  to  the  documents  and  terms.  This  type  of  semantic  analysis 
posits  that  each  document  is  a  mixture  of  a  small  number  of  topics  and  that  the  occurrence  of  every  word  is 
attributable  to  one  of  the  document's  topics.  The  goal  is,  then,  to  identify  the  topics  in  a  corpus. 

In  particular,  Latent  Semantic  Analysis11,12  (LSA)  is  a  very  important,  zero-knowledge  technique.  The  approach 
of  LSA  is  to  first  build  a  matrix,  the  so-called  occurrence  matrix,  that  is  formed  of  word  counts  per  document  (rows 
represent  unique  words  and  columns  represent  each  document)  constructed  from  a  corpus;  and  then  to  look  for 
correlations  between  the  words  by  using  Singular  Value  Decomposition  (SVD)  to  reduce  the  number  of  columns 
while  preserving  the  similarity  structure  among  rows. 

A  key  step  in  LSA  is  dimensionality  reduction  of  the  problem;  this  is  achieved  by  selecting  the  k  -largest 
singular  values  obtained  by  SVD.  An  open  problem11,12  is  how  to  choose  the  cut-off  rank  k0  below  which  the 
singular  values  are  discarded. 

We  propose  the  use  of  RMT  to  identify  k0 ,  assuming  that  the  elements  of  the  occurrence  matrix  are  drawn  from 
a  distribution  that  satisfies  eq.  (5);  without  loss  of  generality  let  us  assume  that  they  are  drawn  from  a  Gaussian 
distribution  with  zero-mean  and  unit  variance. 

To  be  specific,  Let  w:J  be  the  number  of  times  that  word  i  appears  in  document  j  .  In  other  words,  consider  the 
vector  w( ,  i  =  \,...,V  with  dimension  N  (the  number  of  documents  in  the  corpus),  where  V  is  the  number  of  words 
considered. 

The  analysis  would  then  proceed  as  follows:  (1)  compute  the  eigenvalues  of  the  correlation  matrix  C  for  the 
matrix  W  formed  with  the  vectors  w,  after  these  are  normalized  to  have  zero  mean  and  unit  variance; 
(2)  set  Q  =  N/V  and  compute  the  corresponding  RMT  eigenvalue  pdf,  [eqs.  (2)— (3)];  (3)  compare  the 

distribution  of  eigenvalues  of  C  with  /\(rm,) . 

The  rank  of  the  eigenvalues  of  C  at  which  the  empirical  distribution  deviates  from  Pjmt)  should  identify  the 
correct  value  of  k0  to  use.  Furthermore,  the  statistical  significance  of  k0  chosen  thus  can  be  evaluated  with  the  help 
of  the  Tracy- Widom  distribution4 . 

It  might  be  argued  that  words  in  a  document  follow  a  Poisson  rather  than  a  Gaussian  distribution,  but  note  that  for 
a  Poisson  parameter  v  >  10  the  distribution  is  nearly  Gaussian.  Otherwise,  before  standardizing  the  observations,  a 
variance-stabilization  transformation13  can  be  performed.  In  particular,  using  the  transformation 


a  Poisson-distributed  variable  is  transformed  into  a  Gaussian-distributed  one  with  zero  mean  and  variance  1/4. 

To  our  knowledge,  this  technique  to  choose  k0  has  never  been  tried  before;  we  intend  in  the  near  future  to 
investigate  the  practicality  of  this  approach. 

S.  Summary  and  Future  Work 

We  have  shown  that  the  application  of  Random  Matrix  Theory  holds  great  promise  in  detecting  network  traffic 
anomalies  in  the  form  of  anomalous  correlations  between  origin-destination  flows. 

The  results  presented  here  are  exploratory  and  several  questions  remain  to  be  studied: 

1 .  We  have  detected  a  network  outage  using  RMT.  However,  this  kind  of  anomaly  introduces  a  perfect 
correlation  for  a  finite  time  span.  How  can  we  detect  less  than  perfect  correlations? 
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2.  Some  of  the  correlations  between  flows  are  benign,  for  example,  the  24hr  variations  due  to  changes  in  human 
activity.  How  can  these  be  filtered  out  from  the  analysis? 

3.  What  are  the  correlation  signatures  of  other  type  of  anomalies,  in  particular  anomalies  of  malicious  origin? 

4.  What  is  the  optimal  time- window  for  analysis? 

5.  So  far,  we  have  focused  on  the  largest  eigenvalue  of  the  covariance  matrix;  what  can  be  learned  from  the 
second-largest  and  subsequent-largest  eigenvalues? 

6.  Last,  but  not  least,  in  our  analysis  we  used  the  OD  flows  instead  of  the  raw  packet  data.  Can  we  apply  RMT  to 
the  raw  data?  This  question  is  relevant  for  the  implementation  of  automatic,  real-time  systems  for  anomaly 
detection. 

As  for  Natural  Language  Processing,  we  have  proposed  a  new  method  based  on  RMT  to  reduce  the 
dimensionality  of  the  problem  when  tackled  with  LSA.  A  still  to  be  proven  advantage  of  our  proposal  is  the 
systematic  way  to  choose  the  cut-off  rank  for  the  relevant  dimensions. 
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